AITopics

Technology: Information Technology > Artificial Intelligence (0.30)

Neural Information Processing SystemsOct-3-2025, 00:28:02 GMT

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner.

artificial intelligence, glow-tts, machine learning, (16 more...)

Industry:

Information Technology (0.68)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.69)

Neural Information Processing SystemsOct-3-2025, 00:27:52 GMT

Thanks all the reviewers for the detailed and thoughtful comments

Thanks all the reviewers for the detailed and thoughtful comments. HMM-based works [1, 2, 3], all of which proposed methods to estimate alignments from unsegmented data. We've not thoroughly explored to improve the duration predictor and simply follow the same We design the grouped 1x1 convolutions to be able to mix channels. For example, to generate a speech of 5.8 Therefore, adopting parallel TTS models significantly improves the sampling speed of end-to-end systems. In Section 5.3, we showed that varying temperature can change We will add a reference about Viterbi training.

artificial intelligence, machine learning, tacotron 2, (17 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.36)

Neural Information Processing SystemsAug-22-2025, 02:26:33 GMT

f63f65b503e22cb970527f23c9ad7db1-AuthorFeedback.pdf

machine translation, new version, transformer tts, (13 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.35)

Irengbam, Gangular Singh, Wahengbam, Nirvash Singh, Khumanthem, Lanthoiba Meitei, Oinam, Paikhomba

Text to Speech System for Meitei Mayek Script

arXiv.org Artificial IntelligenceAug-12-2025

This paper presents the development of a Text-to-Speech (TTS) system for the Manipuri language using the Meitei Mayek script. Leveraging Tacotron 2 and HiFi-GAN, we introduce a neural TTS architecture adapted to support tonal phonology and under-resourced linguistic environments. We develop a phoneme mapping for Meitei Mayek to ARPAbet, curate a single-speaker dataset, and demonstrate intelligible and natural speech synthesis, validated through subjective and objective metrics. This system lays the groundwork for linguistic preservation and technological inclusion of Manipuri.

artificial intelligence, deep learning, machine learning, (16 more...)

2508.0687

Country: Asia > India (0.15)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.93)

arXiv.org Artificial IntelligenceJun-3-2025

Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models

Lee, Kyowoon, Stitsyuk, Artyom, Jho, Gunu, Hwang, Inchul, Choi, Jaesik

Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.

artificial intelligence, machine learning, natural language, (12 more...)

2506.00832

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Neural Information Processing SystemsJun-1-2025, 23:53:13 GMT

Reviews: FastSpeech: Fast, Robust and Controllable Text to Speech

Originally: Although phoneme duration prediction is widely adopted in conventional TTS systems, jointly training it in a neural TTS model is new. This paper is one of the first works on non-autoregressive text-to-spectrogram modeling. Quality: This paper seems sound overall, expected for a few issues in the comments below. Some of these issues must be addressed before acceptance. Clarity: A well written paper. Significance: The advantages over its autoregressive counterparts are significant, especially for industrial use.

fastspeech, robust and controllable text, tacotron 2, (4 more...)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

arXiv.org Artificial IntelligenceJan-3-2024

Incremental FastPitch: Chunk-based High Quality Text to Speech

Du, Muyang, Liu, Chuan, Lai, Junjie

ABSTRACT Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptivefield constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications. Index Terms-- text-to-speech, speech synthesis, realtime, low-latency, streaming tts 1. INTRODUCTION In recent years, Text-to-Speech (TTS) technology has witnessed Figure 1: Incremental FastPitch, Chunk-based FFT Block, and remarkable advancements, enabling the generation of Chunk Mask for Receptive-Filed Constrained Training natural and expressive speech from text inputs.

fastpitch, incremental fastpitch, inference, (16 more...)

2401.01755

Country:

North America > United States (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)

arXiv.org Artificial IntelligenceOct-19-2023

Energy-Based Models For Speech Synthesis

Sun, Wanli, Tu, Zehai, Ragni, Anton

Recently there has been a lot of interest in non-autoregressive (non-AR) models for speech synthesis, such as FastSpeech 2 and diffusion models. Unlike AR models, these models do not have autoregressive dependencies among outputs which makes inference efficient. This paper expands the range of available non-AR models with another member called energy-based models (EBMs). The paper describes how noise contrastive estimation, which relies on the comparison between positive and negative samples, can be used to train EBMs. It proposes a number of strategies for generating effective negative samples, including using high-performing AR models. It also describes how sampling from EBMs can be performed using Langevin Markov Chain Monte-Carlo (MCMC). The use of Langevin MCMC enables to draw connections between EBMs and currently popular diffusion models. Experiments on LJSpeech dataset show that the proposed approach offers improvements over Tacotron 2.

ebm, hypothesis, tacotron 2, (12 more...)

2310.12765

Country: Europe > United Kingdom > England > South Yorkshire > Sheffield (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceMar-9-2023

Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports

Chung, Hyunseung, Kim, Jiho, Kwon, Joon-myoung, Jeon, Ki-Hyun, Lee, Min Sung, Choi, Edward

Electrocardiogram (ECG) synthesis is the area of research focused on generating realistic synthetic ECG signals for medical use without concerns over annotation costs or clinical data privacy restrictions. Traditional ECG generation models consider a single ECG lead and utilize GAN-based generative models. These models can only generate single lead samples and require separate training for each diagnosis class. The diagnosis classes of ECGs are insufficient to capture the intricate differences between ECGs depending on various features (e.g. patient demographic details, co-existing diagnosis classes, etc.). To alleviate these challenges, we present a text-to-ECG task, in which textual inputs are used to produce ECG outputs. Then we propose Auto-TTE, an autoregressive generative model conditioned on clinical text reports to synthesize 12-lead ECGs, for the first time to our knowledge. We compare the performance of our model with other representative models in text-to-speech and text-to-image. Experimental results show the superiority of our model in various quantitative evaluations and qualitative analysis. Finally, we conduct a user study with three board-certified cardiologists to confirm the fidelity and semantic alignment of generated samples. our code will be available at https://github.com/TClife/text_to_ecg

ecg signal, machine learning, natural language, (19 more...)

2303.09395

Country:

Asia > South Korea > Seoul > Seoul (0.04)
Asia > South Korea > Incheon > Incheon (0.04)
Asia > South Korea > Daejeon > Daejeon (0.04)

Genre: Research Report > Experimental Study (0.68)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)